CRM114 versus Mr. X: CRM114 Notes for the TREC 2005 Spam Track
نویسندگان
چکیده
This paper discusses the design decisions underlying the CRM114 Discriminator software, how it can be configured as a spam filter, and what we may glean from the preliminary TREC 2005 results. Unlike most other filters, CRM114 is not a fixed-purpose antispam filter; rather, it’s a general purpose language meant to expedite the creation of text filters. The pluggable CRM114 architecture allows rapid prototyping and easy support of multiple classifier engines; rather than testing different cutoff parameters, the CRM114 TREC test set tested different classifier algorithms and learning protocols. NIST Text REtrieval Conference (TREC) This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved. Copyright c © Mitsubishi Electric Research Laboratories, Inc., 2005 201 Broadway, Cambridge, Massachusetts 02139 CRM114 versus Mr. X: CRM114 Notes for the TREC 2005 Spam Track Fidelis Assis, William Yerazunis, Christian Siefkes, Shalendra Chhabra 1: Empresa Brasileira de Telecomunicações – Embratel, Rio de Janeiro, RJ, Brazil 2: Mitsubishi Electric Research Laboratories, Cambridge MA 3: Database and Information Systems Group, Freie Universität Berlin, BerlinBrandenburg Graduate School in Distributed Information Systems 4: Computer Science and Engineering, University of California, Riverside CA Abstract: This paper discusses the design decisions underlying the CRM114 Discriminator software, how it can be configured as a spam filter, and what we may glean from the preliminary TREC 2005 results. Unlike most other filters, CRM114 is not a fixedpurpose antispam filter; rather, it’s a general purpose language meant to expedite the creation of text filters. The pluggable CRM114 architecture allows rapid prototyping and easy support of multiple classifier engines; rather than testing different cutoff parameters, the CRM114 TREC test set tested different classifier algorithms and learning protocols. This paper discusses the design decisions underlying the CRM114 Discriminator software, how it can be configured as a spam filter, and what we may glean from the preliminary TREC 2005 results. Unlike most other filters, CRM114 is not a fixedpurpose antispam filter; rather, it’s a general purpose language meant to expedite the creation of text filters. The pluggable CRM114 architecture allows rapid prototyping and easy support of multiple classifier engines; rather than testing different cutoff parameters, the CRM114 TREC test set tested different classifier algorithms and learning protocols. Introduction and History: The CRM114 Discriminator was initially begun as a rough design in 1998 as a hypothetical filter design, meant to categorize incoming mail into a significant number of different “topics” (and with spam as an acknowledged target, but by no means the sole target). One of the design assumptions that CRM114 has held from day 0 has been that single word features are not as important as Nfeature tuples. The initial CRM114 test code worked on Nletter tuples, rather than N word tuples, but the groundwork had been laid. Testing against the very poorly camouflaged spam of 1999, the lettertuple code showed an accuracy in excess of 98%, providing convincing evidence that the concepts of tuple features being substantially superior to singletoken features and that a “no preconceived notions” machinelearning classifier could converge to superior accuracy over any humandesigned heuristic system. Work on CRM114 progressed past the initial EMACS macros and shell scripts that proved the concept to a Ccoded multiclass classifier in 2000. With this came the realization that most of the work of filtering and categorizing email was not in the classifier itself it was in massaging the inputs and outputs to conform to the vary large number of individual configurations, mail transfer agents, mail delivery agents, mail user agents, and user desires. Rather than wrap the learning and classification functions inside voluminous shell scripts, an initial look at making a Perl module that performed the N tuple, Nclass classification was made. However, Perl being what it was, the idea of a standalone language made for manipulating the long overlapping strings needed for text classification became much more tenable. The first general public release of CRM114 under the Gnu Public License (GPL) occurred in early 2001, although “friends and family” releases had occurred since the beginning of the project. This initial release has been followed by “whenever it’s time” releases ever since. Typically CRM114 major releases occur three to six times a year, with bugfix and “secret new feature” releases being done roughly monthly (or whenever a significant advance occurs). The releases are coded with both a date of release, and a nickname; the nickname often indicates the major contributor or motivator for that release. New releases have included multiple classifier engines (the current release has seven different classifiers built in) as well as major improvements such as TRAP/FAULT support, a JIT compiler, a preprocessor, and full 8bitsafe data paths. Overview of the CRM114 language: The CRM114 language is described in depth in the manual CRM114 Revealed [Yerazunis 2005], available for free download from the CRM114 web page at http://crm114.sourceforge.net. The language is a command based language with a declensional (rather than positional) syntax, and only one data type – the overlapping string. All CRM114 functions are designed to operate efficiently on strings that may contain other partially overlapping strings; this includes efficient reclamation of unused string space. The language contains only overlapping strings; there are no specific constructs for numbers, pointers, continuations, coroutines, open files, or anything else except overlapping strings. The CRM114 language contains specific constructs for stringoriented matching, alteration, iteration, decision, and translation, however in the current context the most important statements are the LEARN and CLASSIFY statements. LEARN and CLASSIFY accept a uniform set of parameters and properly map those parameters onto whichever classifier is requested. Thus, it’s possible to change classifiers in a CRM114 program (or even compare different classifiers, such as a Bayesian and a Winnow, or a Markovian and a Hyperspatial) without having to change the rest of the support structure. Most of the CRM114 classifiers start with the concept of a word tuple as the basic feature to be considered. The tuple is defined as a series of words, and a word is defined as whatever matches a programmersupplied regex (the default is that a word is one or more printable characters surrounded by whitespace). CRM114 kits include the very efficient TRE approximate regex engine [Laurikari 2001][Laurikari 2004] allowing a large amount of freedom in defining a word. Structure of a CRM114 classifier: CRM114 classifiers are designed to be “pluggable”; a new classifier can be written in as little as a few hours by using a basic classifier code as a skeleton. A classifier is given a pointer to the unknown text, the unknown text’s length, and a pointer to an argument parameter block containing any flags or worddefinition regexes supplied; the classifier returns a condition code and optionally a textual representation of the results of the classification. Almost all of the CRM114 classifiers use tuplebased features. That is, as words are tokenized by the regex engine, they are recombined into a series of short phrases of between one and five words long. The recombination sequence varies with the classifier and with the classifier flags; in newer versions of CRM114 any classifier can be configured to use single words as features with the keyword; in the case of classifiers using Bayes rule this results in a Grahamstyle Bayesian filter (albeit one with an unlimited “peak window”). A more detailed description of the CRM114 classifiers may be found in [Chhabra 2004] and [Siefkes 2004]; we will give only a basic description here. The original CRM114 classifier was the Markovian classifier; this classifier uses a Sparse Binary Polynomial Hashing (SBPH) feature generator that uses all inorder sequences of up to five words to form feature phrases. For example, the sentence “TREC is sponsored by NIST” will generate the following feature phrases: TREC TREC is TREC sponsored TREC is sponsored TREC by TREC is by TREC is sponsored by TREC NIST TREC is NIST TREC sponsored NIST TREC is sponsored NIST TREC by NIST TREC is by NIST TREC is sponsored by NIST All sixteen of these phrases are then used as features; in the Markovian filter they are weighted unequally; longer phrases are given more weight than short ones [Yerazunis 2004]. The OSB, OSBF, Winnow, and Hyperspace classifiers all use a simpler feature generator which (paradoxically) simultaneously gives higher performance and higher accuracy. This feature generator is called OSB, for Orthogonal Sparse Bigram; it generates only four features instead of sixteen. Using the same example text “TREC is sponsored by NIST”, the features are: TREC is TREC sponsored TREC by TREC NIST Note that the unigram feature (single words taken one at a time, in isolation) is not used in the OSB feature set; extensive testing shows that presence of the unigram does not improve results; in fact, it sometimes causes a paradoxical decrease in accuracy. After generating the features, a lookup is done into the stored statistics files. In order to accelerate this lookup, CRM114 does not use a generalpurpose database such as MySQL or PostgreSQL. Instead, the feature is converted to a 64bit hash, and that hash value used as the starting index for searching an in memory hash table index. On the average, this lookup takes only six adjacent 32bit reads from memory (assuming that the feature hasn’t been seen before and isn’t in the CPU L1 cache). Almost all CRM114 classifiers use this hashing scheme to accelerate feature lookups. As a secondary effect, the actual text is not stored; this gives a modicum of security to the statistics files (although individual features would fall quickly to a dictionary attack, the sequences would be more difficult, and of course, it would be impossible to recover any particular learned text with certainty.). Once the individual feature lookups are performed, the counts are multiplied by the appropriate feature weight and an appropriate set of conversion and combining rules are used to convert individual feature frequencies into a final judgment. In Markovian, OSB, and Bayesian operation which by default use a Bayes Rulelike combining law, the conversion from frequencyofoccurrence to a priori probability is: in_class occurrences – out_of_class occurrences Plocal = 0.5 + 16 * ( total occurrences + 1) Note that this produces a local probability that is strongly biased toward 0.5 (that is, toward uncertainty). These probabilities are weighted and combined via this modified Bayes chain rule to yield the output probabilities: Pinitial_class_i * Plocal_class_i Pfinal_class_i = SUM over all classes (Pinitial_class_j * Plocal_class_j) These probabilities are then normalized. Additionally, two manipulations are performed – to avoid floatingpoint underflow, the value of the largest probability is renormalized at each step from the sum of the smaller probabilities, and for the purposes of humanreadable output, the result is converted to pR; pR is the base10 log of the probability ratio, and given IEEE floatingpoint inputs, varies from 340 to +340. This allows expression of a very large range of numbers in human readable terms. CRM114 also has some support for a chisquared decision rule, however that support was not used in the current NIST TREC test set. Filter Configurations As Tested: The four filter configurations for CRM114 were OSBF, Winnow, OSB Unique, and “plain” OSB. Because CRM114 is a language, not just a filter, changing between these four variants only requires setting a parameter variable to the CLASSIFY and LEARN statements. The version of CRM114 used was CRM11420050518.BlameMercury, (except for the Winnow version, which uses the Java implementation created by Christian Siefkes, available at http://www.inf.fuberlin.de/inst/agdb/software/tie/ because of a suspected minor bug in the CRM114 implementation that causes very slight discrepancies between what should be identical programs). Additionally, all four variants enabled some form of “microgrooming”, a method of aging out old data from the statistics files to keep the statistics files from growing without bound. Microgrooming is based on the concept that when it is necessary to make room in the database file, and given two seldomseen features, one old and one new, get rid of the older one first as it has a longer history of not being useful. Because CRM114’s hash storage structure uses linear intable overflow chaining, this information is indirectly available as the distance each feature is actually found in, compared to its nominal location (i.e. the further down the overflow chain that we actually find a feature compared to where the feature would have been placed if no overflow was present, the newer the feature is.). The Java Winnow system used exact explicit rather than implicit approximate aging; results are similar. All four filter configurations use a phrasecreation window of length 5 and OSBstyle features. By using these Nword tuples, OSB features provide a Markov Random Field model rather than a Bayesian model of the texts learned [Chhabra 2004]. None of the CRM114 filters use any preprocessing whatsoever. There is no stop word removal, no tagging of items in the headers versus those in the body, and no decoding of any type, not even MIME decoding. All four classifiers operated on the “as seen” texts, without any further interpretation, and the default word defining regex was used (specifically, [[:graph:]]+ defines a word except in the Java Winnow code which used the X tokenizer from [Siefkes 2004]; the X tokenizer is XML/HTMLaware and is expressed as [^\p{Z}\p{C}][/!?#]?[\p{L}\p{M}\p{N}]*(?:["’=;]|/?>|:/*)? using Unicode categories. Descriptions of individual filters: We will now describe the four configurations tested. OSBF: OSBF is a variant of OSB that uses confidence factors that vary depending on the number of documents learned, The confidence factor is used to influence the local perfeature probability and distanceweighting of the features is used, with an inverseexponential weighting as distance increases. Beyond that, OSBF uses Bayes rule to combine the local feature probability (a gross abuse of Bayes’ assumptions of independence that most filter authors simply live with). Unlike most Bayesian spam filters, there is no “top 10” window of the most extreme values; all values are used, no matter how insignificant. Unfortunately, the version of OSBF used in these tests contained a significant bug in the chain rule, hence the values reported below are much worse than the algorithm is capable of. The training was configured so that if a text did not score at least +10 pR units in the correct class, then the text was trained, whether or not it actually was classified correct by some narrow margin. The first 500,000 characters of each text were used. WINNOW: WINNOW is an implementation of Nick Littlestone’s Winnow algorithm. Winnow is comparable to a backpropagating perceptron individual features start out with a weight of 1.0000 and are multiplicatively promoted or demoted according to whether the feature is found or not found in a particular text being learned. Instead of Bayes rule, WINNOW uses a simple summing of the weights; the class with the highest weight count wins. Our Winnow implementation uses a thick threshold for learning. Documents are trained not only in case of mistakes, but also if the determined score was near the threshold. This makes the classifier more robust when classifying borderline instances. See [Siefkes 2004] for a detailed description. OSB and OSB Unique: OSB and OSB Unique use the basic local probability formula above without confidence factors; like OSBF they use all local probability values (there is no extremum window). Feature counts are weighted in an empirically derived decreasing sequence. There is only one difference between OSB and OSB Unique: whether a repeated feature is allowed to further influence the output or is simply discarded. Among all tested classifies, only OSB uses all repeated features. OSB Unique discards repeated features, as do OSBF and Winnow. In both OSB and OSB Unique, a singlesided thick threshold for training of +20 pR units was used; any text not scoring at least +20 pR units in the correct class was trained. Only first 4096 characters of each text were used for classification; this is the default for one of the more commonly used CRM114 mail filters deployed. Analysis of the TREC preconference partial release results: The following four tables show how CRM114 fared with respect to the preconference data release. 1ROCAC% and Average LAM% are only prepublished as median values; final LAM% is not prepublished at all (note that unlike some filters, none of the CRM114 configurations contained any preloaded learning. Everything was learned strictly on the fly.) Despite this “no preload”, all of CRM114’s classifiers beat both the prepublication median 1ROCAC% and LAM%, often by an order of magnitude. The design of most CRM114 classifiers assumes a “balanced error” configuration that all misclassifications are equally bad. Although common wisdom is that a false reject is more dangerous than a false accept, at least one author considers this incorrect. Consider the situation of a welldone phishing spam being falsely accepted. Then, consider it being believed by a nontechnical computer (say, a grandmother); the likely resulting financial costs of this one false accept will approach if not exceed the likely financial costs of a false reject of a typical legitimate financial notification. Thus, to a first approximation, we consider that all misclassifications are roughly equally costly and an equal errorrate configuration is a good loss minimization guideline. To examine the behavior of filters operated in this equalerror configuration, we consider two “sweet spots” that our experience show are reasonably representative of equalerror operation specifically, what the error rates are for spam when the good error rate is fixed at 1%, and what the error rate for good is when the spam error rate is fixed at 1%. These representative “sweet spots” are shown as the last two columns in the charts. The “sweet spot” column header shows three values – the best value for any filter tested, the median value, and the worst value. A sweet spot entry in BOLD TYPE is used to highlight where CRM114 performance was either “best in class” or “statistically indistinguishable at the 95% confidence level from best in class” over the entire set of 44 filter configurations tested. As shown below, for every corpus, and for every “sweet spot”, at least one of the four CRM114 configurations was either “best”, or statistically indistinguishable from “best”. Interestingly, no one configuration of CRM114 was consistently best. This may well be a manifestation of the No Free Lunch theorem in decision and optimization [Wolpert 1997] MRX Corpus: The MRX ( Mr. X) corpus is a “single person” corpus sponsored by Prof. Gordon Cormack of Waterloo University. It is biased toward spam – it contains about 9K good emails and 40K spams: (about 18% good emails). Here we see that Winnow, operating with the OSB feature set, was statistically indistinguishable in performance with the best of the 44 filter configurations submitted to TREC. It also had an impressively small 1ROCAC% of just 0.051. MRX corpus 1ROCAC% median= 1.613 AVG LAM% median = 2.53 Final LAM% Ham 1.0% 0.34 24.24 99.65 Spam 1.0% 0.37 13.35 99.16 OSBF 0.311 (0.244 0.397) 1.46 (1.34 1.59) 1.4 (1.13 1.62) 3.85 (2.82 5.22) 2.85 (2.49 3.27) OSB Winnow 0.051 (0.035 0.075) 0.60 (0.53 0.68) 0.24 (0.17 0.33) 0.43 (0.31 0.62) 0.51 (0.37 0.69) OSB Unique 0.177 (0.128 0.246) 0.98 (0.85 1.13) 0.23 (0.12 0.42) 1.07 (0.86 1.35) 1.07 (0.83 1.39) OSB 0.218 (0.157 0.304) 0.79 (0.69 0.90) 0.34 (0.23 0.52) 0.63 (0.50 0.79) 0.56 (0.42 0.75) FULL corpus: The FULL corpus is a fairly balanced corpus. It contains 39K good emails and 53K spams (that’s 42% good email). In this corpus, the plain OSB (that is, a tuplebased Bayesian filter, using no prefiltering or other “gimmicks”, was the best filter tested across all TREC submissions, for both sweet spots. Interestingly, omitting replicated features degraded filter performance to barely statistically significant levels (the error bars still overlap, but the central values are not covered in the opposite configuration’s error bars.). Note that in this corpus, OSB beats OSB Unique in everything but 1ROCAC% (0.049 versus 0.042) – keep this in mind as you read the next corpus. FULL corpus 1ROCAC% median= 0.861 LAM% median = 2.02 Final LAM% Ham 1.0% 0.23 10.35 94.44 Spam 1.0% 0.15 14.20 98.11 OSBF 0.169 (0.151 0.190) 1.74 (1.66 1.83) 3.6 (3.28 3.97) 3.14 (2.81 3.50) 3.15 (2.94 3.38) OSB Winnow 0.122 (0.102 0.145) 0.73 (0.68 0.79) 0.86 (0.72 1.02) 0.68 (0.59 0.79) 0.51 (0.42 0.62) OSB Unique 0.042 (0.031 0.056) 0.63 (0.56 0.70) 0.43 (0.33 0.56) 0.35 (0.30 0.42) 0.21 (0.16 0.27) OSB 0.049 (0.035 0.068) 0.47 (0.43 0.52) 0.37 (0.29 0.47) 0.23 (0.19 0.27) 0.15 (0.11 0.20) SB Corpus: The SB corpus is a smaller corpus with 6K good emails and only 775 spams (88% good); thus it is heavily weighted toward filters with a prejudice to accept borderline emails as good. Both OSB and OSB Unique did very well in this corpus – both are statistically indistinguishable from the best of the 44 filters submitted to TREC. However, OSB Unique seems to have a slight edge in 1 ROCAC%, LAM%, and Final LAM%. Note that this is an inversion from the FULL corpus, where OSB was better than OSB Unique. SB corpus 1ROCAC% median= 2.845 LAM% median = 4.79 Final LAM% Ham 1.0% 3.48 27.23 100.00 Spam 1.0% 3.79 63.76 99.52 OSBF 2.393 (1.689 3.382) 2.34 (1.87 2.93) 0.91 (0.56 1.47) 8.77 (7.05 10.88) 97.11 (57.03 99.88) OSB Winnow 1.888 (1.315 2.704) 2.14 (1.68 2.73) 1.2 (0.73 1.88) 10.32 (8.32 12.74) 62.35 (37.84 81.84) OSB Unique 0.231 (0.142 0.377) 1.64 (1.28 2.10) 0.59 (0.34 1.03) 3.48 (2.26 5.32) 4.01 (2.40 6.63) OSB 0.393 (0.203 0.759) 1.67 (1.32 2.12) 0.44 (0.24 0.80) 3.74 (2.32 5.97) 5.62 (1.99 14.86) TM Corpus: The TM corpus is also highly biased toward nonspam emails it contains 150K good emails and 19K spams (88% good emails). Here OSB is statistically clearly better than OSB Unique; the error bars don’t even overlap. What’s also of interest is that Winnow beats OSB in 1ROCAC%, LAM%, and ties it in Final LAM%, yet the 95% confidence interval error bars for OSB and Winnow barely touch, with OSB being “better” in both sweet spots (apparently equal to the best of all 44 filters submitted to TREC.) TM corpus 1ROCAC% median= 2.712 LAM% median = 2.54 Final LAM% Ham 1.0% 1.04 10.28 99.06 Spam 1.0% 1.07 47.70 99.40 OSBF 0.790 (0.720 0.868) 2.44 (2.33 2.56) 2.5 (2.28 2.80) 10.28 (9.71 10.89) 16.93 (13.00 21.74) OSB Winnow 0.166 (0.138 0.201) 0.79 (0.74 0.85) 0.46 (0.39 0.54) 1.30 (1.15 1.46) 1.86 (1.34 2.57) OSB Unique 0.195 (0.160 0.238) 1.08 (1.01 1.14) 0.68 (0.59 0.78) 1.58 (1.40 1.78) 1.71 (1.49 1.97) OSB 0.272 (0.225 0.329) 0.83 (0.77 0.89) 0.46 (0.39 0.54) 1.04 (0.92 1.16) 1.07 (0.86 1.34) Conclusions: As can be seen, one or more of CRM114’s classifiers was either the best or statistically indistinguishable from the best in all eight sweet spots (4 corpora x 2 sweet spots/corpus). However, interesting inversions occurred in things like 1ROCAC% and rate of learning. It seems that the No Free Lunch theorem [Wolpert 1997] is alive and well with a vengeance in the spam filtering world; statistically significant variation exists with even simple changes like whether to allow repeated features as classifier input or not. The design decision to make CRM114 a “learning” system versus a “prelearned” system seems to be justified; at least one CRM114 classifier learned fast enough to win every 1% sweet spot in every corpus. The design decision to use a hashbased statistics system rather than a database seems to have been justified; because the hashbased system is so fast, CRM114 developers can feasibly test hundreds of variations when other developers can only test two or three different ideas. The design decision to make CRM114 a language rather than a singlepurpose spamfilter tool seems to been justified; because the language allows easy creation of multiple coexisting variants that can be tested against each other easily.
منابع مشابه
Three Non-Bayesian Methods of Spam Filtration: CRM114 at TREC 2007
For the TREC 2007 conference, the CRM114 team considered three nonBayesian methods of spam filtration in the CRM114 framework – an SVM based on the " hyperspace " feature==document paradigm, a bitentropy matcher, and substring compression based on LZ77. As a calibration yardstick, we used the welltested and widely used CRM114 OSB markov random field system (basically unchanged since 2003). T...
متن کاملWorkload Characterization of Spam Email Filtering Systems
Email systems have suffered from degraded quality of service due to rampant spam, phishing and fraudulent emails. This is partly because the classification speed of email filtering systems falls far behind the requirements of email service providers. We are motivated to address this issue from the perspective of computer architecture support. In this paper, as the first step towards novel archi...
متن کاملSeven Hypothesis about Spam Filtering
For TREC 2006, the CRM114 team considered several different hypothesis on the topic of spam filtering. The hypothesis were that: 1 Spammers were changing tactics to successfully evade contentbased spam filters; 2 A pretrained database of known spam and nonspam improves overall accuracy; 3 Repeated training methods are more effective than singlepass Train Only Errors training 4 KNN/Hyperspace...
متن کاملAn evaluation of Naive Bayes variants in content-based learning for spam filtering
We describe an in-depth analysis of spam-filtering performance of a simple Naive Bayes learner and two current variants. A set of seven mailboxes comprising about 65,000 mails from seven different users, as well as a representative snapshot of 25,000 mails which were received over 18 weeks by a single user, were used for evaluation. Our main motivation was to test whether two variants of Naive ...
متن کاملFiltering spam at e-mail server level with improved CRM114
Security managers and network engineers are increasingly required to implant corporative spam-filtering services. End-users don't want to interact with spam-classify applications, so network engineers usually have to implement and manage the spam-filtering system at the e-mail server. Due to the processing speeds needed to put these solutions into work at the server level, the options at hand a...
متن کامل